Smith et al. Genome Biology 2014, 15:R22 
http://genonnebiology.conn/201 4/1 5/2/R22 



^ Genome Biology 



RESEARCH 



Open Access 



Genetic ancestry of participants in the National 
Children's Study 



.2,3" 



Erin N Smith , Kristen Jepsen ' , Angelo D Arias ' , Peter J Shepard ' , Christina D Chambers '■ 
and Kelly A Frazer^'^'^'^' 



Abstract 

Background: The National Children's Study (NCS) is a prospective epidemiological study in the USA tasked with 
identifying a nationally representative sample of 100,000 children, and following them from their gestation until 
they are 21 years of age. The objective of the study is to measure environmental and genetic influences on growth, 
development, and health. Determination of the ancestry of these NCS participants is important for assessing the 
diversity of study participants and for examining the effect of ancestry on various health outcomes. 

Results: We estimated the genetic ancestry of a convenience sample of 641 parents enrolled at the 7 original NCS 
Vanguard sites, by analyzing 30,000 markers on exome arrays, using the 1000 Genomes Project superpopulations as 
reference populations, and compared this with the measures of self-reported ethnicity and race. For 99% of the 
individuals, self-reported ethnicity and race agreed with the predicted superpopulation. NCS individuals self-reporting 
as Asian had genetic ancestry of either South Asian or East Asian groups, while those reporting as either Hispanic White 
or Hispanic Other had similar genetic ancestry. Of the 33 individuals who self-reported as Multiracial or Non-Hispanic 
Other, 33% matched the South Asian or East Asian groups, while these groups represented only 4.4% of the other 
reported categories. 

Conclusions: Our data suggest that self-reported ethnicity and race have some limitations in accurately capturing 
Hispanic and South Asian populations. Overall, however, our data indicate that despite the complexity of the US 
population, individuals know their ancestral origins, and that self-reported ethnicity and race is a reliable indicator of 
genetic ancestry. 



Background 

The major goal of the National Children's Study (NCS), 
authorized by the US Congress through the Child Health 
Act of 2000, is to discover and characterize environmen- 
tal exposures that contribute to causation of disease or, 
conversely, that enhance children's health (Children's 
Health Act of 2000, Public Law 106-310 Sec. 1004). The 
pilot phase of the NCS, known as the Vanguard Study, is 
a small-scale study, using convenience sampling, which 
is being conducted to evaluate the feasibility, acceptabil- 
ity, and costs of the methods that will be used to carry 
out the main study. The Vanguard Study began in 2009 
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with a total of 7 locations, or Vanguard Centers, and 
grew to include 40 sites. 

Race and ethnicity are associated with environmental 
risk factors for disease [1], such as tobacco smoke [2], 
air quality [3], and food environments [4]. While the re- 
lationship between race and genetics has been conten- 
tious [5,6], it is clear that genetic factors associated with 
disease can vary with racial background, resulting in 
common disease loci differing between ancestral groups 
[7]. These environmental and genetic differences could 
result in ethnicity and race being associated with various 
health outcomes, such as cancer treatment [8] and toxi- 
cology [9], and are therefore important to consider in 
large epidemiological studies of environmental influ- 
ences on development, such as the NCS. 

Efforts over the past 5 years to genotype human popula- 
tions have shed light on human genetic diversity, human 
population evolution, and migration patterns [10-12]. In 
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recent efforts to comprehensively identify the majority of 
common variation (>1%) in worldwide populations, the 
1000 Genomes Project (IKG) [13] is sequencing 2,500 in- 
dividuals from 25 world populations. Currently, genotype 
array data are available for 21 of these populations, which 
are classified into 5 superpopulations: African (AFR), Ad 
Mixed American (AMR), East Asian (ASN), European 
(EUR), and South Asian (SAN). Advances in genotyping 
technologies have yielded cost effective tools, such as the 
$54/ sample Illumina HumanExome Array, allowing for 
genotyping of ancestry informative markers, the majority 
of genome-wide association study (G WAS) -associated 
loci, and rare coding variations potentially associated with 
disease. 

Our goal was to determine whether self-reported race 
and ethnicity was concordant with genetic ancestry in a 
sample of representative American counties. Because in- 
dividuals can be descended from diverse ancestries that 
may not be well captured in census categories or may 
not be similar to genetic reference groups, we aimed to 
identify potential systematically misclassified groups to 
guide downstream questionnaires and genetic assays. 

To this aim, we examined whether self-reported ethni- 
city and race accurately assesses the genetic ancestry of 
participants in the NCS. DNA from 641 NCS -enrolled 
parents from 7 Vanguard sites was successfully assayed 
using exome arrays, and 29,972 markers were used for 
ancestry estimation. We compared the genetic profiles 
of the NCS participants with those of the reference pop- 
ulations, and determined for each individual whether 
self-reported race and ethnicity was consistent with their 
most similar IKG superpopulation. We also examined 
race and ethnicity categories for which we were unable 
to predict a match, such as Multiracial, and used the 
genetic predictions to infer population groups that may 
not be adequately captured by the current race and eth- 
nicity categories. 

Results and discussion 

Using face-to-face interviews, self-reported race and 
ethnicity information was collected from 645 partici- 
pants from 7 counties. Questionnaire responses allowed 
for two ethnicities (Hispanic or Non-Hispanic) and six 
race categories (Black or African American, American 
Indian or Alaska Native, Asian, Native Hawaiian or 
Other Pacific Islander, White, and Some Other Race), 
and multiple categories could be picked (Multiracial). 

DNA from whole blood was isolated and genotyped 
using the Illumina HumanExome Array, with 641 sam- 
ples passing quality control criteria. The HumanExome 
Array was designed with approximately 3,000 ancestry 
informative markers that distinguish between European 
and African American ancestry, and 1,000 markers that 
distinguish between European and Native American 



ancestry. Additional content included sites that could 
vary by population, but that were not chosen for ances- 
try informativeness, such as GWAS single nucleotide 
polymorphisms (SNPs), coding variation, randomly se- 
lected synonymous sites, and human leukocyte antigen 
(HLA) tags. To identify all sites that were informative 
for ancestry, we calculated informativeness [14] to dis- 
tinguish between the 5 superpopulation groups of the 
IKG Project and identified around 30,000 sites with 
positive informativeness. 

For each NCS participant, we identified the most similar 
IKG super population. Using the ancestry informative 
SNPs, we clustered the genotypes of the NCS participant 
with the IKG participants using multidimensional scaling 
(MDS). To identify the most similar superpopulation, we 
created a linear discriminant model based on the top 20 
dimensions of the MDS, and trained it using the IKG 
data. Then, based on the model, we predicted the most 
likely superpopulation for each NCS participant (Table 1). 
We additionally performed this analysis using the 21 IKG 
populations for which we had data (see Additional file 1: 
Table SI). 

For each self-reported race and ethnic stratum, we iden- 
tified which IKG super population(s) we expected the 
group to match (Table 1). When multiple superpopula- 
tions were plausible, they were all included as expected 
matches. For example, we expected self-reported Hispanic 
African Americans to be most similar to either the African 
(AFR) or the American Admixed (AMR) IKG superpopu- 
lations. We did not include those that identified them- 
selves as Multiracial or Non-Hispanic Other (a total of 33 
individuals) in the concordance estimates. For the NCS 
participants, we observed high levels of agreement be- 
tween estimated genetic ancestry and self-reported ethni- 
city and race (Figure 1). Overall, we observed high levels 
of agreement between self-report and estimated ancestry, 
with 601/608 (98.8%) concordant calls. 

Clustering can be visualized by plotting the first MDS 
components against each other for the IKG (Figure lA, B) 
and the NCS individuals (Figure IC, D). Data points were 
plotted in the first and second dimensions (Figure 1A,C) 
and in the second and third dimensions (Figure 1B,D). 
The results showed that AFR, EUR, and ASN superpopu- 
lations are clearly differentiated in the first two dimen- 
sions, while the SAN and AMR groups are overlapping, 
reflecting their historical European and East Asian ances- 
try (Figure lA). While the AMR group is broadly distrib- 
uted, indicating that some individuals are genetically more 
similar to the EUR group and others to either the ASN or 
AFR groups, individuals in the SAN group cluster to- 
gether. In the second and third dimensions, SAN and 
AMR are distinctly identifiable (Figure IB). NCS individ- 
uals identified as Asian by self-report overlap with both 
the SAN and ASN groups. This is expected, as a distinct 
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Table 1 Superpopulation classification by self-reported ethnicity and race 





Self-report 


Expected 
superpopulation 




Most similar superpopulation^ '^ 




Ethnicity 


Race 


AFR 


EUR 


ASN 


AMR 


SAN 


Hispanic 


African American 


APR or AMR 


5 


0 


0 


1 


0 


Non-Hispanic 


African American 


APR 


29 


0 


0 


0 




Hispanic 


American Indian or Alasl<a Native 


AMR 


0 


0 


0 


2 


0 


Non-Hispanic 


American Indian or Alaska Native 


AMR 


0 


0 


0 


2 




Non-Hispanic 


Asian 


ASN or SAN 


0 


0 


12 


0 


9 


Non-Hispanic 


Native Hawaiian or Other Pacific Islander 


ASN 


0 


0 


4 


0 


0 


Hispanic 


White 


AMR or EUR 


0 


8 


0 


42 


0 


Non-Hispanic 


White 


EUR 


0 


427 


0 


3* 


0 


Hispanic 


Other 


AMR 


r 


r 


0 


60 


0 


Non-Hispanic 


Other 


No prediction 


1 


1 


0 


1 


1 


Hispanic 


Multiracial 


No prediction 


1 


1 


0 


2 


0 


Non-Hispanic 


Multiracial 


No prediction 


3 


9 


7 


2 


3 


Unl<nown 


Multiracial 


No prediction 


0 


0 


0 


1 


0 






Total 


40 


447 


23 


116 


15 



^Expected groups based on self-report are indicated: African (AFR), European (EUR), East Asian (ASN), Admixed American (AMR) and South Asian (SAN). 
"^Individuals that match their reported group are indicated in bold, and individuals that did not match the reported group are indicated by an asterisk. 



racial category for persons of South Asian descent (largely 
Indian) was not available as a self-reported race category. 

We further investigated individuals that were discord- 
ant with our predictions. Linear discriminant analysis 
provides a relative score for how well each individual 
matches each group, and we observed that discordant 
individuals often matched their second-best superpopu- 
lation prediction. Of the seven discordantly assigned in- 
dividuals, six matched their second most likely super 
population group, and the remaining one matched their 
third most likely group. We also examined our analysis 
of the 21 IKG populations (see Additional file 1: Table 
SI), and observed that 4 of the 7 discordant individuals 
matched a population that was in their best-matched 
superpopulation by self-report, even though they were 
not placed in that group when the 5 IKG superpopula- 
tions were used for the analysis. This suggests that in 
some cases, analyses at population level may be more ac- 
curate for assigning genetic ancestry to an individual 
than analyses at superpopulation level. Overall, however, 
we observed the same level of concordance using popula- 
tions as we did using superpopulations (601/608, 98.8%). 
Discordant individuals were not likely to be the result of 
misidentified samples, because these individuals were col- 
lected from five of the seven NCS sites, and were not con- 
sistent with swaps within each site (data not shown). 

Hispanic White and Hispanic Other self-reported groups 
were determined to be of closely related ancestry, with 78% 
of Hispanic White and 94% of Hispanic Other predicted to 
match the AMR population. However, individuals with a 
self-report of Hispanic White were more likely to match 
the EUR group (22%) than the Hispanic Other (6%), which 



is consistent with individuals that identify as Hispanic hav- 
ing a heritage that includes European and often, but not al- 
ways. Native American ancestry. 

While there was no expected population group for the 
33 individuals who reported being Multiracial or Non- 
Hispanic Other, we were able to assign them to their 
most similar superpopulations. As a group, they showed 
great diversity, with individuals matching to each of the 
five superpopulations. Of note, 11 (33%) of these individ- 
uals matched to the ASN or SAN groups, which were 
less represented in the other categories (27/609, 4.4%). 
These data suggest that individuals of South Asian or 
East Asian descent may not adequately be captured by 
the NCS ethnicity and race categories. 

Comparison of reported ethnicity and race with genetic 
ancestry highlighted the difficulties in properly capturing 
this information for individuals from populations with his- 
torical admixture. For the Non-Hispanic Asian population, 
we observed two clearly distinct populations: those closely 
related to the ASN population, which is composed of Han 
Chinese individuals (from Beijing and Southern China), 
Chinese individuals from Denver (CO), Japanese individ- 
uals, and Kinh individuals from Ho Chi Minh City 
(Vietnam); and those closely related to the SAN popula- 
tion, which is a population composed of Gujarati Indian 
individuals from Texas (Figure 1; see Additional file 1). 
While of related ancestry, these two populations can be 
clearly discriminated genetically, and the currently used 
race category of Asian' does not adequately distinguish 
between individuals of South Asian versus East Asian des- 
cent, highlighting the relevance of using genetically deter- 
mined ancestry rather than self-reported ancestry alone. 
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Figure 1 Genetic clustering between participants in the 1000 Genomes Project (1KG) and National Children's Study (NCS). In total, 1445 
unrelated individuals from ti~ie IKG and 641 from the NCS were clustered on genotypic profiles using multidimensional scaling. (A, B) The IKG 
individuals are color-coded by superpopulation and plotted according to their scores on (A) the first two dimensions and (B) the second and 
third dimensions. (C,D) NCS participants are color-coded by their expected superpopulation group and plotted according to their scores on (C) 
the first two dimensions and (D) the second and third dimensions. Abbreviations: Hisp, Hispanic; NH, Non-Hispanic; Afr, African; Nat. Native; Amer., 
American; OPI, Other Pacific Islander. 



A comparison of genetic ancestry to self-reported eth- 
nicity and race for Hispanic individuals determined that 
the genetic ancestry of those choosing the categories of 
Hispanic, White (50 persons) and Hispanic, Other (62 
persons) is largely the same. Individuals choosing His- 
panic, White or Hispanic, Other were most similar to 
the AMR superpopulation (102/112) (composed of 
Colombian individuals in Medellin, Colombia; Mexican 
individuals from Los Angeles, CA; Peruvian individuals 
in Lima, Peru; and Puerto Rican individuals in Puerto 
Rico) [13], with the remaining individuals matching the 
European or African superpopulations. 

Conclusions 

In summary, we have successfully used the Illumina 
HumanExome Array to classif)^ NCS participants accur- 
ately into superpopulation ancestry groups, consistent 
with self-report. Refinements to self-reported ethnicity 



and race options for both the Non-Hispanic Asian and 
the Hispanic White/Hispanic Other populations would 
result in more accurate determination of the genetic an- 
cestry of these populations. 

Materials and methods 

Study population 

In total, 646 blood samples from parents enrolled in the 
NCS were obtained from the NCS biorepository. The sam- 
ples were collected from seven different NCS Vanguard 
Centers across the USA, including Brookings County, SD 
(which also enrolled participants from Yellow Medicine 
County, MN; Pipestone County, MN; and Lincoln County, 
MN), Duplin County, NC; Montgomery County, PA; 
Orange County, CA; Queens, NY; Salt Lake County, UT; 
and Waukesha County, WI. A total of 710 mothers and 
451 fathers were enrolled, from which 346 mothers and 
300 fathers were selected for the current study. Ethnicity/ 
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race was self-described. Participants were asked during a 
face-to-face interview about their ethnicity and race, 
choosing between two ethnicity categories (Hispanic or 
Non-Hispanic), and six race categories (Black or African 
American, American Indian or Alaskan Native, Asian, Na- 
tive Hawaiian or Other Pacific Islander, White, or Some 
Other Race), from which multiple categories could be 
picked (Multiracial). All participants had provided written 
informed consent for the use of these samples and the 
study was approved through the local site and/or the NCS 
federated institutional review boards. It should be noted 
that that not all NCS participants consented to providing 
biological samples, so the overall diversity of enrollment in 
the NCS may differ slightly from what we report here. 

Of the 646 samples, 346 were from mothers and 300 
were from fathers; 360 individuals were paired partici- 
pants (mother and father), and the remaining individuals 
were a single enrolled parent (166 mothers, 120 fathers). 
Samples consisted of 200 [A of EDTA-treated whole 
blood for mothers, and 2.0 ml of acid citrate dextrose 
(ACD)-treated whole blood diluted with 2.0 ml 20% 
DMSO in RPMI medium (4.0 ml total volume) for fa- 
thers. Although blood sample storage methods varied 
between enrolled mothers and fathers, both were suffi- 
cient for the studies described here. 

DNA isolation 

DNA was isolated from 200 ul of provided whole blood 
sample using a QIAcube and the QIAamp DNA Blood 
Mini QIAcube Kit (Qiagen, Valencia, CA, USA). The 
standard QIAcube isolation program was used, except 
that the elution volume was modified from 100 [A to 
25 [il 

DNA concentrations ranged from 0.81 to 292 ng/(il 
for samples from mothers, and from 2.4 to 97.7 ng/[A 
for samples from fathers. For mothers, a mean (± SD) 
DNA concentration of 68.2 ±51.4 ng/[A was obtained. 
For fathers, the mean DNA concentration was 30.5 ± 
18.3 ng/(iL. There was one sample (mother) from which 
no DNA could be isolated, resulting in 645 of 646 sam- 
ples with successful isolation of DNA. 

Illumina infinium HD HumanExome BeadChip assay 

Using the Illumina Infinium HD HumanExome Bead- 
Chip Assay, 6 [A each DNA sample was analyzed. 
Samples were processed according the manufacturers 
specifications. We observed an average SNP call rate of 
99.2% per sample. Three samples (all mothers) had SNP 
call rates below the 90% and therefore failed quality con- 
trol. One individual was subsequently removed due to 
high genome-wide similarity (proportion identical by 
descent (PI_HAT) approximately 1) to another sample. 
Overall, we successfully screened 641 of the 645 DNA 
samples (failure rate of <0.5%). 



The HumanExome BeadChip was designed through a 
collaborative effort of multiple academic groups ([15]) in 
order to capture rare and common coding variation. It 
includes over 240,000 variants identified from diverse 
populations, and in addition to coding variants, includes 
ancestry informative markers (n = 3,468), SNPs associ- 
ated with a range of common conditions, such as type 2 
diabetes, cancer, metabolic, and psychiatric disorders 
(n = 4,761 SNPs), and additional sites of scientific inter- 
est. Because the array was designed to capture coding 
variants, it captures only 10% of common variation 
through linkage disequilibrium (r^ > 0.8). 

HumanExome array processing 

Genotypes were called using GenomeStudio (v2011.1). 
Briefly, genotype intensities were reclustered together 
across all samples, and default criteria for genotype qual- 
ity (GenCall Score >0.15) was used to filter poorly called 
genotypes. Genotypes were converted from Illumina 
TOP orientation to genome orientation (b37) using the 
HumanExome- 12vl_A files generated through the Well- 
come Trust Center for Human Genetics ([16]). Sites re- 
ported as 'Cautious Sites' ([17]) were removed. Sites 
were annotated to dbSNP 135 identifiers using The Gen- 
ome Analysis Toolkit (GATK) [18]. 

Data quality control 

Data generated on the Illumina HiScan system was 
subjected to three basic quality control measures. 
Initially, a qualitative assessment of the assay perform- 
ance was determined by visual inspection of the internal 
control probes on the array to ensure effective staining, 
hybridization, base extension, and washing. Samples 
were required to have a call rate of 90% or greater (three 
samples failed). Finally, to identify potential sample mis- 
identification, the reported sex was compared with the 
sex calculated from homozygosity estimates across all 
SNPs on the X chromosome with a MAF >0.1 (no sam- 
ples were assigned the wrong sex). 

Ancestry estimation 
Reference populations 

We used the participants of the IKG [13] as a reference 
population for ancestry identification. These individuals 
derive from 21 different population groups covering five 
superpopulations: African (AFR), East Asian (ASN), 
South Asian (SAN), European (EUR), and Ad Mixed 
American (AMR) (see Additional file 1). We obtained 
genetic data for 1,445 unrelated individuals profiled on 
the Illumina Omni 2.5 array ([19]), and annotated sites 
to dbSNP 135 identifiers using GATK [18]. We focused 
on the 41,572 sites that overlapped those on the Huma- 
nExome Array. 
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Ancestry informative markers 

We prioritized markers by their ability to distinguish an- 
cestry and for being independent of each other. We cal- 
culated informativeness [14] based on the five IKG 
superpopulations (AFR, ASN, SAN, EUR, and AMR) 
and chose markers in order of informativeness that were 
in linkage equilibrium (r^ < 0.2) with previously chosen 
markers within 1 Mb. Sites were not filtered based on a 
minor allele threshold. We ultimately identified 29, 972 
markers that were shared between the Omni 2.5 array 
and the HumanExome array for use in ancestry 
estimation. 

To estimate ancestry, we identified for each NCS par- 
ticipant the most similar IKG superpopulation group 
and population group (Table 1; see Additional file 1). 
Specifically, we clustered genotypes from all participants 
at ancestry informative markers using MDS in PLINK 
[20]. We then developed a model based on the first 20 
MDS components using linear discriminant analysis (Ida 
command in MASS package [21] in R) with the IKG in- 
dividuals as a training set. We next used the NCS indi- 
viduals as a test dataset and predicted the most likely 
superpopulation and population groups for each partici- 
pant. We compared predicted groups with groups based 
on self-reported ethnicity and race. Because there was 
not a 1:1 relationship between self- reported ethnicity 
and race and the IKG superpopulation groups, we de- 
veloped expected relationships (Table 1). Individuals 
were considered discordant if their groups disagreed 
with the expected superpopulation assignments. 

Additional file 



Additional file 1: Table listing the expected groups based on self- 
report and the population each individual matches. Genotype and 
phenotype data are available through dbGaP: http://www.ncbi.nlm.nih. 
gov/projects/gap/cgi-bin/study.cgi?study_id=phs000662.vl .pi . 
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